Skip to content

Benchmark multi-column GROUP BY performance#22322

Open
nathanb9 wants to merge 6 commits into
apache:mainfrom
nathanb9:benchmark-multi-group-by
Open

Benchmark multi-column GROUP BY performance#22322
nathanb9 wants to merge 6 commits into
apache:mainfrom
nathanb9:benchmark-multi-group-by

Conversation

@nathanb9
Copy link
Copy Markdown
Contributor

@nathanb9 nathanb9 commented May 18, 2026

Which issue does this PR close?

Rationale for this change

  1. Measure multi-column GROUP BY performance with a fair apples-to-apples comparison
  2. The default vectorized per-column approach (GroupValuesColumn) underperforms the row-based approach (GroupValuesRows) when distinct group count is small relative
    to input rows. The benchmark confirms this: row-based is 16-19% faster below ~200K groups, while vectorized wins by 15-33% above ~500K groups.

What changes are included in this PR?

Adds a benchmark in datafusion/physical-plan/benches/multi_group_by.rs that directly calls GroupValues::intern() with identical Int32 data for both implementations —
no SQL/planning/IO overhead, same schema, same hashing.

Makes mod row public so the benchmark can instantiate GroupValuesRows directly.

Test cases:

  • Issue In general the multi group by option seems to make certain scenarios worse. #17850 reproduction (3 cols, 64 groups, 1M-50M rows) — confirms row-based wins ~16-19%
  • Low cardinality sweep (8-4096 groups, 3-4 cols) — row-based wins 15-38%
  • Batch size sensitivity (1K-32K) — minimal effect on ratio
  • Column scaling with low groups (2-10 cols) — row-based advantage grows with columns
  • High cardinality scaling (1M groups, 2-10 cols) — vectorized wins 22-43%
  • Group count sweep (16 to 1M groups, 4 cols) — crossover at ~200K-500K groups

Are these changes tested?

  • cargo fmt --all
  • cargo clippy -p datafusion-physical-plan --bench multi_group_by -- -D warnings
  • cargo bench -p datafusion-physical-plan --bench multi_group_by

Are there any user-facing changes?

No. This adds a benchmark only.

@github-actions github-actions Bot added the core Core DataFusion crate label May 18, 2026
@nathanb9 nathanb9 marked this pull request as ready for review May 18, 2026 08:19
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanb9
Thanks for putting this benchmark together.
I think there are a few issues that make it hard to validate the conclusions the PR is trying to draw around GroupValuesColumn vs GroupValuesRows behavior.


fn generate_parquet_file(num_cols: usize, cardinality: usize) -> NamedTempFile {
let mut rng = StdRng::seed_from_u64(42);
let fields: Vec<Field> = (0..num_cols)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the benchmark only generates Int32 grouping columns, so DataFusion will always pick the vectorized GroupValuesColumn path for these multi-column GROUP BYs.

The PR description talks about measuring when the default per-column implementation starts to lose against the row-based GroupValuesRows implementation, but this benchmark never exercises the row-based path, so it cannot really validate the claimed crossover point.

Could we add a second benchmark variant that explicitly uses GroupValuesRows, or benchmark the two GroupValues implementations directly? That would make it possible to compare both implementations across the same cardinality and column-count combinations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this benchmark never exercises the row-based path, so it cannot really validate the claimed crossover point.

Yes, the way I benchmarked is by making code change to enforce one or the other and run them separately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. I added a new benchmar that directly instantiates both GroupValuesColumn (vectorized) and GroupValuesRows (row-based) and calls their intern() method with identical Int32 data

}

#[expect(clippy::needless_pass_by_value)]
fn query(ctx: Arc<Mutex<SessionContext>>, rt: &Runtime, sql: &str) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like each Criterion iteration calls ctx.sql(sql) before df.collect(), which means planning and logical/physical optimization are currently included in the measured path.

The module docs say the query is pre-planned and that the benchmark is only measuring execution, so this seems a bit inconsistent. It also makes the column-count experiment harder to interpret because planning cost grows with the generated projection and grouping expressions.

Could we construct the plan once during setup and only time repeated execution of that plan? Otherwise it would help to update the benchmark naming/docs to make it clear that planning is intentionally included.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of pre-planning a physical plan (which can't be re-executed due to internal state), the new benchmark bypasses SQL/planning entirely by calling GroupValues::intern() directly on pre-generated in-memory batches.

b.iter(|| query(b_ctx.ctx.clone(), &rt, &build_group_by_sql(4)))
});

let b_ctx = prepare_context(&rt, 4, 30); // 30^4 = 810K groups
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the benchmark names/comments currently report the theoretical key-space size rather than the number of distinct groups actually produced.

With around 1M input rows, the 30^4 = 810K case only produces roughly ~574K observed groups in expectation, and the 100^4 / 500^4 cases are effectively capped near the input row count instead of producing 100M or 62B distinct groups.

Since the PR is trying to identify a distinct-group-count threshold, these labels end up being a bit misleading. Could we either generate data with controlled exact NDV per case, or rename/report these as key-space cardinalities and include the measured distinct counts alongside them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just addresed by using sequential enumeration (global_row % num_distinct_groups decomposed per-column) which guarantees the exact number of distinct groups matches the label. No random sampling

Nathan Bezualem and others added 2 commits May 20, 2026 10:58
…g overhead, exact NDV

Addresses all three review comments from @kosiew:

1. **Implementation comparison**: Benchmarks both GroupValuesColumn (vectorized,
   via Int32 columns) and GroupValuesRows (row-based, via FixedSizeBinary(4)
   columns that trigger the fallback path) side-by-side.

2. **Execution-only timing**: Pre-optimizes the logical plan once via
   `df.into_parts()`. Each benchmark iteration only does physical planning +
   execution, excluding SQL parsing and logical optimization.

3. **Exact cardinality**: Replaces random sampling with sequential enumeration
   (`global_row % num_distinct_groups` decomposed per-column), guaranteeing
   precise distinct group counts with no birthday-paradox error.

Additionally motivated by apache#17850,
adds comprehensive experiments:
- Issue apache#17850 regression reproduction (3 cols, 64 groups, 1M-50M rows)
- Low cardinality sweep (8-4096 groups)
- Batch size sensitivity (1K-32K)
- Column count scaling (2-10 cols, low and high cardinality)
- Group count sweep (16 to 1M groups)
- Random vs sequential data patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a fair apples-to-apples benchmark that directly calls
GroupValues::intern() with identical Int32 data for both
GroupValuesColumn (vectorized) and GroupValuesRows (row-based).

This eliminates the previous confounding factors (different data types,
SQL/planning overhead) and confirms the regression reported in apache#17850:
row-based is 16-19% faster at low cardinality (64 groups), with a
crossover at ~200K-500K groups where vectorized becomes faster.

Experiments:
- Issue apache#17850 reproduction (3 cols, 64 groups, 1M-50M rows)
- Low cardinality sweep (8-4096 groups)
- Batch size sensitivity (1K-32K)
- Column count scaling (2-10 cols)
- High cardinality scaling (1M groups)
- Group count sweep (16 to 1M groups)
- Random vs sequential data patterns
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label May 20, 2026
pub mod multi_group_by;

mod row;
pub mod row;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for the benchmark to directly instantiate GroupValuesRows with the same Int32 schema used by GroupValuesColumn. Without this, benchmarks can only trigger the row-based path via an unsupported type, which makes the comparison unfair.

@github-actions github-actions Bot removed the core Core DataFusion crate label May 20, 2026
@nathanb9
Copy link
Copy Markdown
Contributor Author

Thanks for the review @kosiew.

Changes from your comments:

  1. Both GroupValuesColumn and GroupValuesRows are now benchmarked directly via the intern() API with identical Int32 data
  2. Benchmark bypasses SQL/planning entirely
  3. Exact cardinality via sequential enumeration. Labels now match actual distinct group counts.

One thing im not sure about: making mod row public. This is needed because new_group_values() with an Int32 schema will always route to GroupValuesColumn. There's no way to make it pick GroupValuesRows for Int32 without. I can make it so only thing pub is pub use row::GroupValuesRows;

@nathanb9 nathanb9 requested a review from kosiew May 20, 2026 18:51
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanb9
Thanks for the updates here. The benchmark changes address the earlier concerns around comparing GroupValuesColumn vs GroupValuesRows directly and removing SQL planning overhead from the measured path.

I still found one issue with the 1M-group benchmark cases. Because generate_batches truncates to whole batches, the generated input for the advertised 1_000_000 row/group cases is actually smaller than the benchmark labels suggest.

.powf(1.0 / num_cols as f64)
.ceil() as usize;

let num_batches = num_rows / batch_size;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_batches currently computes num_batches = num_rows / batch_size, which drops the remainder rows.

With the default 8192 batch size and num_rows = 1_000_000, the benchmark generates only 122 * 8192 = 999_424 rows. That means the grp_1M / cols_*_grp_1M cases can produce at most 999_424 observed groups, not 1,000,000.

So while the sequential enumeration change fixes most of the earlier cardinality-label mismatch, the exact 1M threshold cases are still slightly off.

Could we either generate the final partial batch, or adjust/report the benchmark labels based on the actual generated row/group counts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed now I get the remainder as well

pub mod multi_group_by;

mod row;
pub mod row;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the question about pub mod row: if the public access is only needed for the benchmark, I think it would be cleaner to keep mod row; private and expose only the narrow surface that is needed, for example pub use row::GroupValuesRows;.

That keeps the internal module layout private instead of publishing the full module path just for benchmark access.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. pub use row::GroupValuesRows;

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nathanb9 nathanb9 requested a review from kosiew May 22, 2026 12:19
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanb9
Thanks for the iteration.

Looks 👍 to me

@nathanb9
Copy link
Copy Markdown
Contributor Author

nathanb9 commented May 23, 2026

@kosiew Thanks for your thorough review

@Dandandan Dandandan added this pull request to the merge queue May 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026
@Dandandan Dandandan added this pull request to the merge queue May 23, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants